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Abstract 

In this work we discuss the problem of selecting suitable approximators from 
families of parameterized elementary functions that are known to be dense in 
a Hilbert space of functions. We consider and analyze published procedures, 
both randomized and deterministic, for selecting elements from these families 
that have been shown to ensure the rate of convergence in L 2 norm of order 
0(1/N), where N is the number of elements. We show that both randomized 
and deterministic procedures are successful if additional information about the 
families of functions to be approximated is provided. In the absence of such 
additional information one may observe exponential growth of the number of 
terms needed to approximate the function and/or extreme sensitivity of the 
outcome of the approximation to parameters. Implications of our analysis for 
applications of neural networks in modeling and control are illustrated with 
examples. 

Keywords: Random bases, measure concentration, neural networks, 
approximation 


1. Introduction 

The problem of efficient representation and modeling of data is important in 
many areas of science and engineering. A typical problem in this area involves 
constructing quantitative models (maps) of the type 

Xi,X 2: ■ ■ • ,X d f(x i,X 2 , ■ ■ -,Xd), 
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where x\, X 2 , ■ ■ ■, Xd , Xi £ R, i = 1,..., d are variables and / : R d —> R is an 
unknown functional relation between the variables. The total number, d , of 
variables, determining input data may be large, and physical models of such 
relations /(■) are not always available. 

In the absence of acceptably detailed prior knowledge of “true” models, 
/(•), a commonly used alternative is to express the function /(•) as a linear 
combination of known functions, y '■ R d —> R: 


N 

f(x) ^ In(x) =Y^ c itpi(x), a e R. (l) 

i—1 


Numerous classes of functions in m have been proposed and analysed to 
date, starting from sin(-), cos(-) and polynomial functions featured in classical 
Fourier, Fejer, and Weierstrass results, wavelets [37], [HI], and reaching out to 
linear combinations of sigmoids [7] 
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and other general functions [10] that are often used in neural network literature 
141 ]. Sometimes the values of parameters Wi , h may be pre-selected on the basis 


of additional prior knowledge, leaving only linear weights Ci for training. If no 
prior information is available then both nonlinear (wi and bj ) and linear (c,) 
parameters, or weights, are typically subject to training on data specific to the 
problem at hand (full network training). 

One special feature that makes full training of approximators ©, © partic¬ 
ularly attractive is that in addition to their universal approximation capabilities 
0: E3> 271 and their homogenous structure, they are reported to be efficient 


when the dimension d of input data is relatively high. In particular, if all pa¬ 
rameters Wi, Ci, bi are allowed to vary, the order of convergence rate of the 
approximation error of a sufficiently smooth function /(•) £ C°[0, l] d as a func¬ 
tion of N (the number of elements in the network) is shown to be independent 
of the input dimension d |l9j, [ 5 ]. Furthermore, the achievable rate of conver¬ 
gence of the I/ 2 -norm of /(•) — /jv(-) is shown to be O^/N 1 ' 2 ). This contrasts 
sharply with the rate 0(d~ 1 /N 1 ^) corresponding to the worst-case estimate 
inherent to linear combinations m with (fi(-) given or ©, © with Wi, bi fixed. 
In particular, it is shown in Q that if only linear parameters of ©, © are ad¬ 
justed the approximation error cannot be made smaller than Cd _1 /JV 1//d , where 
C is independent on N, uniformly for functions satisfying the same smoothness 
constraints. 
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Favorable independence of the order of convergence rates on the input dimen¬ 
sion of the function to be approximated, however, comes at a price. Construction 
of such efficient models of data involves a nonlinear optimization routine search¬ 
ing for the best possible values of Wj, 6;. The necessity to adjust parameters 
entering (0 nonlinearly, 0 restricts practical application of these models in 
a range of relevant problems (see e.g., [0, [Zcj, [38|, 251 for an overview of 
potential issues). 

An alternative to adjustment of nonlinear parameters of 0, © has been 
proposed in [iffij ]. Eii] . [18l ] and further developed in [30], Ef (see also ear¬ 
lier work by F. Rosenblatt El in which he discussed perceptrons with random 
weights). In these works the nonlinear parameters Wi and bi are proposed to 
be set randomly at the initialization, rather than through training. The only 
trainable parameters are those which enter the network equation linearly, C;. 
Random selection of weights in the hidden layers is supposed to generate a set 
of (basis) functions that is sufficiently rich to approximate a given function by 
mere linear combinations from this set. This crucially simplifies training in 
systems featuring such approximators, and renders an otherwise computation¬ 
ally complex nonlinear optimization problem into a much simpler linear one. 
Moreover, the rate of error convergence is argued to be 0(1/N l t 2 ) 0 , a, 

EL 


This brings us to a paradoxical contradiction: on the one hand the reported 
convergence rate 0(1/A' 1 / 2 ) which a randomized approximator is supposed to 
achieve contradicts to the earlier worst-case estimate 0(d~ 1 /N 1 ' d ) obtained in 
0. On the other hand numerous studies report successful application of such 
randomization ideas (see also comments [0 ) in a variety of intelligent control 
applications 0 ], EL EL EL 0 as well as in machine learning (see e.g. (H, 
[4l| and references therein). The question, therefore, is if it is possible to resolve 
such an apparent controversy? Answering to this question is the main purpose 
of this contribution. 

The paper is organized as follows. We begin the analysis with Section [3] in 
which we review basic reasoning in and compare these results with [19], 
Q. We show that, although these results may seem inconsistent, they have 
been derived for different performance criteria. The worst-case estimate in Q 
is “crisp”, whereas the convergence rate in 0] is probabilistic. That is it in¬ 
volves a measure function with respect to which the rate (^(l/iV 1 / 2 ) is assured. 
Introduction of measure into the problem brings out a range of interesting con¬ 
sequences that is discussed and analyzed in Section [4] There we approach data 
approximation problem as that of representation of a given vector by a linear 
combination of randomly chosen vectors in high dimensions. A simple logic 
suggests that if one takes m < n random vectors in R" then it can be expected 
that with probability one these vectors will be linearly independent because 
the set of linearly dependent corteges {xi,... ,x m } (m < n) is a proper alge¬ 
braic subset in the space of corteges (R™) m . We can select n random vectors, 
and with probability one it will be a basis. Every given data vector y can be 
represented by coordinates in this basis. If these n vectors are (accidentally) 
too close to dependence we can generate few more vectors that will enable us 
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to represent the data vector y. We will show, however, that this simple and 
correct reasoning looses its credibility in high dimensions. We show that in an 
n-dimensional unit cube [— 1, l] n a randomly generated vector x will be almost 
orthogonal to a given data vector y (the angle between x and y will be close to 
7t/2 with probability close to one). To compensate for this waist concentration 


effect [13[, [12], [11] one needs to generate exponentially many random vectors. 


Typicality of such exponential growth is an inherent feature of high-dimensional 
data representation, including problems of data approximation and modelling. 
Moreover, for high-dimensional data representation the following two seemingly 
contradictory situations are typical in some sense: 


• with probability close to one linear combinations of n — k random vectors 
approximate any normalized vector with accuracy £ if k -C n and no con¬ 
straints on the values of coefficients in linear combinations are imposed 
(and this probability is one, if k = 0 ); 

• with probability close to one an exponentially large number N of random 
vectors are pairwise almost orthogonal and do not span an arbitrarily se¬ 
lected normalized vector if coefficients in linear combinations are not al¬ 
lowed to be arbitrarily large. 


The approach to assess effective dimensionality of spaces based on e-quasior¬ 
thogonality was introduced in 0, 0. The authors demonstrated that in high 
dimensions there exist exponentially large sets of quasiorthogonal vectors. This 
observation has been exploited in the random indexing literature j34]. Here we 
show that not only such sets exist, but also that they are typical in some sense. 
Implications of our analysis are illustrated in Section [5] followed by few examples 
presented in Section [6] Section Q] concludes the paper. 


2. Notation 

Throughout the paper the following notational agreements are used 

• R denotes the field of real number; 

• R” stands for the n-dimensional linear space over the field of reals; 

• let x £ R”, then ||x|| is the Euclidean norm of x: ||x|| = \Jx\ + • • • + x \; 

• if x,y are two non-zero vectors from R" then Z(x,y) denotes an angle 
between these vectors; 

• symbol | • |r« is reserved to denote an arbitrary norm in R n ; 

• S n ~ 1 (R) denotes an n — 1-sphere of radius R centred at 0: S n ~ 1 (R) = 
{x e R”| ||z|| = R } 

• p. is the normalized Lebesgue measure on <S m-1 (l): /r( l S m_1 (l)) = 1. 
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• B n (R) denotes a n-ball of radius R centered at 0: B n (R) = {i£R"| ||x|| < 
R} 

• V(S) is the Lebesgue volume of H C R”. 

• stands for an— 1-disc in the n-ball B n {R) corresponding to its 
largest equator, and T>^ 1 (R) is its 5-thickening. 

• Let / : [0, l] d —» R. be a continuous function, then 

ll/ll 2 = (/,/) = [ f(x)f(x)dx, 

J[ 0 , 1 ] d 

denotes the L 2 -norm of /. 


3. Preliminary analysis and motivation 


In this section we review and compare the so-called greedy approximation 
upon which the famous Barron’s construction is based [4( and the Random 
Vector Functional-Link (RVFL) network |l8| in which the basis functions are 
randomly chosen, and only their linear parameters are optimized. As we show 
below, despite the error convergence rates appear to be of the same order, they 
are in essence dramatically different. One rate is crisp in the sense that it is 
a worst-case estimate for all functions from a given class. The other one is 
probabilistic in nature, and hence holds in probability. The consequences of the 
latter are further investigated in Section [I] 

Consider the following class of problems. Suppose that g : R —> R be a 
function such that 

llsll < M, M e R> 0 , 


and 

Q = {g(w T x + b)}, w € R d , be R, 

be a family of parameterized functions g(-). Let / £ C°([0, l] ra ), and let / belong 
to a convex hull of Q. In other words there is a sequence of Wi, bi , and Cj such 
that 

oo oo 

f{x) = Y. Cig(wfx + bi), 5 > = 1 , Ci> 0 . 

2=1 2=1 

We are interested in approximating / by the finite sum 


N 

/n(x) =^2cig(wfx + bi). (4) 

i—l 


It is important to know how approximation errors, defined in some meaningful 
sense, may decay with TV, and what the computational costs for achieving this 
rate of decay are? 
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3.1. Greedy approximation and Jones Lemma 

In order to answer the question above one needs first to determine the error 
of approximation. It is natural for functions from L 2 to define the approximation 
error as follows: 

e i v = II/jv - /II (5) 

The classical Jones iteration 0 (refined later by Barron 0 ) provides us 
with the following estimate of achievable convergence rate: 


4< 


M ' 2 - 2 


Nel 


§72 > M'> sup |M| + ||/||. 


The rate of convergence depends on d only through the L 2 -norms of /o, g , 
/. The iteration itself is deterministic and can be described as follows: 

fN+i{x) = (1 - a N )f N (x ) + a N g(wJfX + b N ) 

2 


OtN 


' X 


M" 2 


M" > M' 


■ Gn 


( 6 ) 

and 


(7) 


where parameters WN,bpj of g are chosen such that the following condition holds 


{fw ~ f,g(wjf ■ +b N ) - /) < 


((M") 2 - (M') 2 )e 2 N 
2(M") 2 


( 8 ) 


This choice is always possible (see 01 for details) as long as the function / is 
in the convex hull of Q. 

According to ([6]) the rate of convergence of such approximators is estimated 
as 

e% = 0(l/N). 


This convergence estimate is guaranteed because it is the upper bound for the 
approximation error at the iVth step of iteration 0. 


3.2. Approximation by linear combinations of functions with randomly chosen 
parameters 

We now turn our attention to the result in ITsl . In this approximator the 
original function /(•) is assumed to have the following integral representatiorJl 

f(x)= lim lim / F a n(ui)g(aw T x + b)dui, (9) 

(A Q—>-00 J Wd 

where g : M —>• M is a non-trivial function from L 2 : 

0 < / g 2 (s)ds < 00 , (10) 

Jr 


2 We keep the original notation of I18H which uses both to and w for the sake of consistency. 
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where uj = (y, w, u) £ R d x I d x [—Cl, 12], Cl £ K>o, W d = [—212; 212] x I d x V d 
V d = [0; O] x [—Cl; 12] d_1 , b = —( aw T y + u) and 


j-, / \ & —l nji £ / \ / \ / \ 

F a ,n{u) - nd2d -i f(y) CQS n Wi cosn(u) = 


cos(u), u e [—12,12] 

0 ,ui [- 12 , 12 ] 


(see [18J, [23 for more detailed description). Function g(-) induces a parameter¬ 
ized basis. Indeed if we were to take integral © in quadratures for sufficiently 
large values of a and 12, we would then express f(x) by the following sums of 
parameterized g(aw T x + b) fill ]: 


f(x) ~ y ^ag(awf x + bi), h = -a(wfyi + uf) 


( 11 ) 


The summation in CD is taken over points u>i in W d , and c, are weighting 
coefficients. Variables a in © and cxn in © play different roles in each approx¬ 
imation scheme. In © the value of oin is set to ensure that the approximation 
error is decreasing with every iteration, and in © it stands for a scaling factor 
of random sampling. 

The main idea of [18} is to approximate integral representation © of f(x) 
using the Monte-Carlo integration method as 


412 


N 


f(x) — —rr lim lim y^ F a> n(uJk)g(awlx + bk) 
iV CK—>-00 Q —>00 Z ' 


fe=l 


N 


= lim lim y ~'c k ,n(a,cdk)g(awlx + b k ) 

a—>00 S2—>-oo z —' 


( 12 ) 


k—1 


= fN,u>, a(x), 


where the coefficients Cfe(a,Wfc) are defined as 

c k ,a.{a,u k ) = — --cos n{u k )f{y k ) (13) 

and u>k = ( y k ,u! k ,Uk ) are randomly sampled in W d (domain of parameters, i.e., 
weights and biases of the network). 

When the number of samples, N , i.e., the network size, is large, then 


E U (N) 



\f(x) - /Ar, w ,n(x)| 2 c2x, K C [0, l] d , 


(14) 


converges to zero asymptotically for large N. In particular: 

Theorem 1 (Igel’nik and Pao [18]). For any compact K, K C [0, l] d , K ^ 
[0, l] d and any absolutely integrable function g satisfying \10\) there exists a 
sequence of and a sequence probability measures {pN,n,a} such that 

lim E U (N) = 0. 

N—t oo 
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Furthermore, under some additional restrictions on the functions g, f the ex¬ 
pectation E U1 (N ) is shown to obey (Theorem 3 in [18j): 


C 




(15) 


Similar results have been reported in [3l(. There the authors look at the 
following classes of functions: 


F P = {f(x) = / a{u)(t>{x,u)du | |a(w)| < Cp( w)}, sup \<j)(x, w)| < 1, 

Jfl x,w 

where p is a distribution on Q, and 

N 

Tu = {f(x) = y ^ak^jx.bJk) ||afe| < C/N}. 

k=1 

For the chosen classes of functions they establish the following 

Theorem 2 (cf. Lemma 1 in [31]). Let f be a function from T v and w ±,..., 
wn be drawn iid from p. Then for any 1 > S > 0 with probability at least 1 — S 
over wi ,..., Wn there exists a function fjy from so that 

ll/_/w|| -^( 1 + 2 \/^ l )- (16) 

At the first glance errors (fo|). (fTTl) . (flGl) and their respective convergence 
rates look very similar. Yet, they are fundamentally different in that ([5]) is de¬ 
terministic and its convergence rate is crisp, whereas the other two estimates 
have an additional element - a probability measure - characterizing their conver¬ 
gence rate. Introduction of this measure enables to “ignore” worst-case elements 
corresponding to the rate of convergence 0(d~ x /N 1 ^). At the same time, it im¬ 
poses some limitations too since there always is a non-zero probability of an un¬ 
lucky draw from the probability distribution which will require re-initialization 
at some stage of approximation. Furthermore, explicit presence of 1/S in mB 
suggest that this rapid convergence of order 1/A^ 1 / 2 is assured only up to a given 
and fixed tolerance. These features of randomized approximators should thus 
be considered with special care in applications. 

There is one additional point that is inherent to all randomized approxima¬ 
tors and yet it is rarely addressed in practice. This is a measure concentration 
effect that we will present and consider in detail in the next section. 


3 The very same term is implicit in the error rate derived for Monte-Carlo inspired approach 
in Fll . This follows immediately from the Chebyshev inequality. Indeed, if X\. .... X\; 
are iid random variables with the same mean, fi a.nd variance cr 2 then the probability that 
IdYl + • • • + Xn)/N — p\ > 8 is assured to be at most er 2 /(<5 2 IV). 
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4. Randomized Data Approximation and Measure Concentration 

So far we considered the problem of data approximation and modelling from 
the function approximation point of view. Let us, however, look at this problem 
from a slightly different perspective (cf. EH)- Suppose that the data points, x, 
are labeled by real numbers U so that each i-th data point is uniquely deter¬ 
mined by its label ti. This means that the problem can be rephrased as that of 
representing a combined vector y = (f(x(ti)), /(xfa)), ■ ■ •, f(x(t n ))) by linear 
combinations of hi = (g(wf x(ti) + bf), g(wf x(t 2 ) + bi),... ,g(wf x(t„ ) + &»)) so 
that the error 

m 

v~Yj Cihi 

*=i 

is minimized. If our choice of vectors hi ,..., h m is good enough to represent 
an arbitrary vector y with acceptable precision then we can be assured that the 
above approximation model is viable. The questions, however, are 

1) if the choice of hi is randomized then how many vectors one must choose 
to ensure desired universality and accuracy? 

2) how robust such representations are with respect to small perturbation of 
data? 

These questions are addressed in the next sections. For simplicity we ignore 
that the vectors hi are in fact generated by functions <?(•) and proceed assuming 
that they are simply sampled randomly in a hypercube. 


4-1- Linear dependence and independence and related elementary properties 
Let us first recall standard notions of linear dependence and independence. 

Definition 1 (Linear Dependence). A system of vectors hi, h?, ..., h m , hi £ 

R”, i = 1 ,... ,m is said to be linearly dependent if there exist Ci, Ci, ■ ■ ■, c m , 
Ci £ R, i = 1,..., m such that 

hlC\ + I12C2 + • • • + Cmhrn = 0 

and at least one of c\, c^, ■ • •, c m is not equal to zero. 

The same definition can be rewritten in the vector-matrix notation as 


3 c £ R m , c ^ 0 : He = 0, 


where 


H = {hi h 2 ■ ■ ■ h m ) 

is an n x m matrix formed by hi,..., h m . 


(17) 


Definition 2 (Linear Independence). A system of vectors hi, / 12 , ■.., h m , 
hi £ R n , i = 1 ,...,m is said to be linearly dependent if it is not linearly 
independent. In other words, 


He ± 0 V c £ R m : c ^ 0. 


(18) 
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A very simple fact follows immediately from Definition [2] 

Proposition 1. Consider a system of vectors hi,..., h m , and let H = (hi /12 
••• h m ). Let |-| R „ be a norm on R", then the system hi,... ,h m is linearly 
independent iff there exists an e > 0 such that 

\Hx\ Rn > e\/x £S m ~ 1 (l). (19) 

Proof. Suppose first that the system hi,..., h m is linearly independent. Hence 
(fT51) holds, and given that x 0 for all x £ <S m_1 (l) we thus obtain that 
\Hx\ Rn > 0 for all x £ iS m-1 (l). Noticing that |-| R „ is continuous and iS m-1 (l) 
is compact we conclude that |-| R „ takes minimal and maximal values on 5 m_1 (l). 
Let 

rceS m_1 (l) 

The minimum of |-| R „ on iS m_1 (l) is separated away from 0 since otherwise 
there exists a. x € <S m-1 (l) such that Hx = 0. Thus (Il9l) holds. 

Let us now show that (HU implies (fl8l) . Indeed, for any c £ R m , c / 0 
there is an a; £ S m-1 (l) and an a £ R, a 0 such that c = ax. Hence 
|Hc| r „ = |a| \Hx\ Rn > \a\e > 0, which automatically assures that (ITHl) holds. 
□ 

f.2. Quantification of linear independence 

Standard notions of linear dependence and independence are, unfortunately, 
not always easy to assess numerically when the values of e in (flUl) are small. 
Further, checking that for a given system of vectors hi ,..., h m , some e, and all 
x £ iS m-1 (l) the following holds 

\Hx\ Mn > £ (20) 

may not always be feasible or desirable. 

Two ways to relax and quantify the conventional notion of linear indepen¬ 
dence are obviated in Proposition [L] These are 1) the value of e in (fl9l) . and 
2 ) a possibility of introducing a finite measure on 6> m-1 (l) that determines a 
proportion of x from <S m-1 (l) which satisfy (l20l) . 

Definition 3 (Almost Linear Independence). Let hi,... ,h m be a system 
of m normalized vectors from R ra : |/ii| R „ = 1, i = 1,..., m. We will say that the 
system is (e, 8)-linearly independent (almost linearly independent) with respect 
to y. if 

p({x £ 5”*" 1 ( 1) | \Hx\ Rn >e})>l-e. (21) 

Similarly, one can define an alternative quantification of linear dependence: 

Definition 4 (Almost Linear Dependence). Let hi ,..., h m be a system of 
m normalized vectors from R n : |/ii| R n = 1, i = 1,... ,m. We will say that the 
system is (e , 6)-linearly dependent (almost linearly dependent) with respect to p 

if 

p({x £ | \Hx\ Rn <e})>l-6. 
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Notice that the notions of almost linear independence and almost linear de¬ 
pendence introduced in Definitions [3l [4] are consistent with conventional notions 
in the sense that the latter can be viewed as limiting cases of the former. In¬ 
deed, if y is a surface area then setting 6 = 0 in Definition [3] and picking e small 
enough one obtains an equivalent of Definition [2] 

The above measure, or “probabilistic”, quantification of linear independence 
has significant implications for data representation in applications. As we shall 
see in the next sections two seemingly exclusive extremes are likely to hold in 
higher dimensions. First, almost all points of an n-ball concentrate in an e- 
thickening of an n — 1-disc. This means that for m sufficiently large a family 
of randomly and independently chosen vectors hi, / 12 , • • •, h m becomes almost 
linearly dependent. This phenomenon is well-known in the literature (see e.g. 


12], |13(, 111 as well as classical works of Maxwell, Levy, and Gibbs; in Data 


Mining applications measure concentration effects have been discussed in [28y). 
Yet, the values of m for which almost linear independence persists may be 
exponentially large. Furthermore, the latter situation holds with a probability 
that is close to one. This second extreme can be rephrased as that the number 
of almost orthogonal vectors grows exponentially with dimension. More formal 
statements and justifications are provided in Propositions [5J [3] below. 


f.3. Measure concentration 

We begin with the following statement 


Proposition 2. Let B n (R) be an n-ball of radius R in and 0 < 5 < R. 
Then the following estimate holds: 


V(B n (R)) - V(B n (R - 5)) 
V(B n (R)) 


>1 — e 


n S 
R 


( 22 ) 


Proof. Noticing that V(B n (R .)) = C n R n , where C n is a constant independent 
on R we conclude that 


V(B n (R))-V(B n {R-S)) ( R-S) n f 6 \ 

V(B n (R)) ~ R n ~ \ RJ 


Next we recall that the following elementary inequality holds for all 0 < x < 1: 

(1 — x)- < (1 — x)» < -. (23) 

e e 

Indeed, with respect to the right part of (1231) . (1 — x)i < e -1 , we notice that 
(1 — x)* < e -1 <t=> 1 — x < e~ x . The function y = e~ x is strictly convex on 
K., and y = 1 — x is its first-order Taylor approximation at x = 0 . Thus that 
(1 — x) * < - holds true for 0 < x < 1 is the consequence of the strict convexity 
of the exponential e~ x . 

In order to see that the left part of (l23l) holds true too, consider the following 
chain of equivalent inequalities for 0 < x < 1: 

(1 — x)e -1 < (1 — x)i O e~ x < (1 — x) 1- * «=> — x < (1 — x) ln(l — x). 
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Again, y = (1 — a")ln(l — x ) is strictly convex on (—oo,l), and y = —x is 
its first Taylor approximation at x = 0. Hence the left part of the inequality, 
(1 — x)e~ 1 < (1 — x)* , is the consequence of strict convexity of (1 — x) ln(l — x). 
Using (l23l) one can derive that 


e 




_ nS 

< e R . 


Moreover, for 8/R sufficiently small we obtain 



with accuracy estimate following from (l23l) . □ 

In accordance with Proposition [2] the volume of an n-ball of radius R , for n 
sufficiently large, is concentrated in a thin layer around its surface. Furthermore, 
in this thin layer the volume of an n-ball is concentrated around the largest 
equator of the corresponding n—1 sphere, S n ~ 1 (R). 


Proposition 3. Let B n {R) be an n-ball of radius R in and 0 < 5 < R. Let 
T>g~ 1 (R) be a 8-thickening of an n — l-disc 'D rL ~ 1 (R). Then 


V{B n {R))-V{V n s ~ 1 {R)) 

V{B n (R)) 


Proof. Consider 


V(B n (R)) - V(B n (VR 2 - S 2 )) 


1 - 1 - 


V{B n (R)) 
' 

R 2 


Using (l23ll we obtain 


1 - 1 - 


£_ 

R 2 


> 1 — e iR 1 . 


and the result follows automatically from 

V{B n {R)) - V(T>g~ 1 (R)) < V(B n {VR 2 -8 2 )). 

□ 

In high dimension the volume of the ball is concentrated in a thin layer near 
the sphere. Therefore, the estimate of the volume of the disk automatically 
provides an estimate of the surface of the corresponding waist of the sphere. 
Let us produce this estimate explicitly. The proportion of <S n-1 (l) belonging to 
the cap (shaded part of the sphere in Fig. [T]) equals the proportion of the solid 
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ball that lies in the corresponding spherical cone (cf. [ 3 ], Fig. 11). The latter 
consist of two parts: one is the cone of height <5 and radius of the base \/l — S 2 . 
The volume of the second part can be bounded from above by the half of the 
volume of the ball with radius y/l — <5 2 - If we use the Stirling formula for the 
volume of high-dimensional ball V n (R) ~ -^= (then we obtain that 

nS 2 

the fraction of the waist of the width 26 is 1 — (1 + 0(5 / y/n))e “. Indeed 


W 2 / R \ 2 

K(A) = r( » +1) fl", V n (R) = 2n \^=j V n . 2 (R), 

rr=±ii 

v n (R) = RV^ v{ ^^ l _ ) Vn-i{R). 

The estimate now follows from 


r(a:) = x x ~^e~ x 


\p2/n 


1 + YRc +R 2[x) 


(24) 


where the reminder R 2 (x) can be bounded as 


|i? 2 (a;)| < 


1 + g^ 

27r 3 a: 2 


(25) 


(see (3.11) from [bj for details). This estimate improves the textbook estimate 
1 — 2e~ 11 ^ - for large n jd . 

The concentration effect described in Proposition [3] implies that in high 
dimensions nearly all independently and randomly drawn vectors will belong to 
a (5-thickening of a set of vectors that are linearly dependent: the n — 1-disc 
V n ^ 1 (R). On the other hand a set of n— 1 vectors with probability close to one 
spans almost all vectors. We note that the latter propertvholds for systems of 
n — k, k > 1 vectors too which follows immediately from [2j: 


Theorem 3 (Theorem 3.1 in (2]). Let Ek be a k-dimensional subspace of 
and denote by p,((Ek) e ) the Haar measure on the sphere 5(1) of the set 
of points within a geodesic distance smaller than e of Ek- We write k = An. 
Fix 0 < £ < 7r/2 and 0 < A < 1. The following estimates hold as n —> oo 

(i) If sin 2 £ > 1 — A ; then 


- 1 


1 \/A(l — A) 

y/nTT sin 2 £ — (1 — A) 


(ii) If sin 2 £ < 1 — A ; then 




1 \J A(1 - A) 

y/rvt r (1 - A) - sin 2 £ 


where u{ A, e) = (1 - A) log 4^ + A log . 
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Figure 1: Illustration to the estimate of the 26 -width of the waist of the sphere. For (b), 
n 1. More accurate estimate with evaluation of the reminder can be extracted from (1241) . 
(| 25 }. 


14 
















Since nearly all vectors in B n { 1) are concentrated in an £-disc it is interesting 
to know how many pairwise almost orthogonal vectors can be found in this set. 
It turns out that this number is exponentially large. Detailed analysis and 
derivations are provided in the next section. 

4-4- Almost orthogonality in high dimensions 

Select two small positive numbers, e and d. Let us generate randomly and 
independently N vectors Xi ,..., Xn on S” -1 (l). We are interested in the prob¬ 
ability P that all N random vectors are pairwise e-orthogonal, i.e. \(xi,Xj)\ < e 
for j = 1,..., N, i j. For which N this P > 1 — i?? 

Propositions [SJ [3] and subsequent discussion suggest that for n 1 almost 
all volume of an n-ball B n ( 1) is concentrated in an e-thickening of its largest 
equator. Moreover for an arbitrarily chosen point p on the surface of B n ( 1) 
almost all points of the ball belong to the set 2?C" _1 ( 1) comprising of the dif¬ 
ference of 2?" _1 (1) with the corresponding spherical cones (see Fig. [lj. For the 
given choice of p and 2?C”” 1 ( 1) the following estimate holds 

| cos (Z(p, x))| <eVi€ 1). 

The latter property means that the vector p is almost orthogonal to nearly all 
remaining points in B n ( 1). 

Another way of looking at this could be as follows. In accordance with 
Proposition [2l almost all points of the ball B n { 1) are concentrated around a e- 
thickening of surface. At the same time they are also concentrated in 2?™ -1 (l). 
Lengths of such points, x, satisfy 1 — e < ||x|| < 1, and hence 

\v T x\ e 

| cos (Z(p,x))\<f^j<—. 

Let us determine the number of independent random vectors which are pair¬ 
wise e-orthogonal with probability 1 — d. The volume taken by all vectors that 
are almost orthogonal to a given vector on a unit sphere can be estimated from 

nS 2 

Proposition [3] (up to a small correction term 0(6e 2 “ /y/n)). Consider the 

following products 

1 v 

P(e, N) = ^1 — ke j . (26) 

fe=l 

The value of P(e,N) is an estimate from below of the probability of a set of 
N + 1 independent random vectors to be pairwise e-orthogonal. Indeed, for one 
vector hi the fraction of the vectors which are not e-orthogonal to hi is evaluated 

_ c2 

as e . Therefore, for k vectors hi,... ,hk, the fraction of the vectors which 
are not e-orthogonal to hi ,..., hk is at most ke~ nS . The probability to select 
randomly a vector hk+ i, which is e-orthogonal to hi ,..., hk, is higher than 1 — 
ke~ . The vectors are selected independently, therefore we have the estimate 

HU). 
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The value of P(e, N) in (12611 can be estimated as follows. For Ne ” 2 < 1: 

P(e,N) > (l-Ne—*~) ~ e~ N e 2 . (27) 

According to (1271) . if P(e,N ) is set to be exceed a certain value, 1 — 0, the 
number of pairwise almost orthogonal vectors in B n ( 1) will have the following 
asymptotic estimate: For 


e 2 n 

r / i \i 

1 

2 

N < e 4 

hli-JJ 



(28) 


all random vectors hi ,..., hisi+\ are pairwise e-orthogonal with probability P > 
1 - 0 . 

Estimate (l27l) of (l26l) can be refined if we apply log to the right hand side 
of (USD 

N 

log P{e,N) = ^ log (l - ke~ 2 ^~^j , 


fc=l 


end estimate the above sum using the integral 

rz — 1 


J(z) = / log(l — xr)dx = 


log(l — rz) — z, r = e 2 


Since log P(e, N) is monotone for e ” 2 > N > 1 we can conclude that J(N+1) < 
log P(e,N) < J(N). Furthermore, given that 


log(l - x) < - 


2x 

2 — x 


for all i € [0,1] 


the following holds 


T/ . 2 z(rz — 1) 

A*) > — - IT- ~ * 

rz — 2 


- (IV+1) < P(e,N) 


for all zr € [0,1], Hence 

2{N + l)(r(iV + 1) — 1) 
r(N + 1) - 2 

Transforming this equation into quadratic 

r(N + l) 2 - log(l - 0)r(A^ + 1) + 2 log(l 

and solving for N results in 


log(l - 0). 

0 ) < 0 


N < 


l 0 g 2 (l — 0) 


2 log 


1-0 


, 10g(l - 0) 

e 2 H--- 


(29) 


Notice that the refined estimate (EiTl) has asymptotic exponential rate of order 
e e n / 4 which is identical to the one derived in (l28l) . 
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5. Discussion 


Estimate (1281) . (1291) derived in the previous section suggests that, for •& suf¬ 
ficiently small a set of N randomly and independently chosen vectors in B n { 1) 
will be pairwise e-orthogonal with probability 1 — $ for 


N < e 4 $ 2 


Such exponential bound enables us to explain some apparent controversy [39( 
regarding convergence rates of approximation schemes based on iterative gr eedy 
roximation ^sj], [H> and randomized choice of functions advocated in 18|, 

31|- 


Both greedy approximation and systems of randomized basis functions enjoy 
the convergence rate of order 1 /IV 1 / 2 in L 2 -norm irrespective of dimensionality of 
the domain on which a function that is being approximated is defined. Greedy 
approximation, however, requires solving nonlinear optimization problems at 
each step. In randomized strategies, parameters of the basis functions/kernels 
are randomly drawn from a given set. This transforms the original problem into 
a much simpler linear one. 

Notice, however, that the constant C in the error rate OiC /IV 1 / 2 ) can be¬ 
come rather large in these schemes. If, for example kernels 4>{x, uj) in Theorem^ 
approach Dirac delta-functions, then the constant C in (flTH) becomes arbitrarily 
large. The same problem may occur for the error rate m- This means that 
practical application of randomized approximation schemes is to be preceded 
by additional analysis of the class of functions being approximated. Second, 
achievable rate in these schemes is in general dependent on the values of S, 
determining probability of success (cf. (1161) : see our comment on Monte-Carlo 
convergence in probability too). The smaller the value of S the higher is the 
probability of achieving the rate OiC/N 1 / 2 ). That rate, however, is propor¬ 
tional to a strictly monotone function of l/<5. This might be acceptable in many 
practical applications. Nevertheless, if particularly high accuracy is desirable 
one needs to take this into account. 

An insight into the slowing-down of convergence of randomized approxima¬ 
tors due to either increased C or due to the need to keep S small can be gained 
by considering measure concentration effects discussed in Propositions^ EH The¬ 
orem [3] and in Section [4.41 Indeed, in accordance with the analysis presented a 
system of n — fc, k > 1 vectors spans almost all vectors in a ball B n (1). This 
system, however, is almost linearly dependent. This means that the coefficients 
in corresponding linear combinations could become arbitrary large. Large coef¬ 
ficients, in turn, make the representation sensitive to small inaccuracies which 
is of course generally not very desirable. This observation of typicality of large 
coefficients in randomized choice of relatively short bases is consistent with our 
earlier remarks about growth of C and influence of 1/8. On the other hand, if a 
representation of data is needed in which all coefficients are to be within certain 
bounds then the number of randomized basis vectors may become exponentially 
large. 
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To overcome potential danger of exponential growth of the number of ele¬ 
ments needed to achieve acceptable level of performance of a randomized ap¬ 
proximator one may constrain dimensionality of randomization by “binning” or 
partitioning the space of original problem into a set of spaces of smaller dimen¬ 
sions. An example of such an optimization would be to use e.g. multi-scale 
basis functions, followed by randomization at each scale. Cascading, multi-grid, 
frequency separation, localization could all be used for the same purposes. The 
achieved efficiency will of course be determined by suitability of each of these 
“binning” approaches for a problem at hand. Of course such a binning will 
violate isotropy of the space. Violation of isotropy can make random approxi¬ 
mation much more efficient. For example jfjj proved that if the vectors of signals 
are sparse in a fixed basis, then it is possible to reconstruct these signals with 
very high accuracy from a small number of random measurements. 

6. Examples 

6.1. Error convergence rate. Deterministic vs Randomized approaches 

In order to illustrate the main difference between greedy and RVFL approx¬ 
imators, we consider the following example in which a simple function is ap¬ 
proximated by both methods, greedy approximation © 0 and approximation 
based on randomized choice of bases. Let f(x) be defined as follows: 

f{x) = 0.2e-( ltte - 4 > 2 + 0.5e-( 8to - 4 °) 2 + O.3 e -( 8te - 20 > 2 

The function /( x) is shown in Fig. [2] top panel. Clearly, /(•) belongs to the 
convex hull of G, and hence to its closure. 

First, we implemented greedy approximation 0 0 in which we searched 
for g n in the following set of functions 

g = { e -^ Tx+b ^ 2 }, 

where w £ [0,200], b £ [—100,0]. The procedure for constructing f n was as 
follows. Assuming fo(x) = 0, eo = — / we started with searching for w\, b\ such 
that 


{0-f(x),g(w 1 x + b)~ f(x)) = 

- (f(x),g{w 1 x + b )) + ||/(x)|| 2 < e. 

where e was set to be small (e = 10~ 6 in our case). When searching for a 
solution of (13U1) (which exists because the function / is in the convex hull of 
G [3), we did not utilize any specific optimization routine. We sampled the 
space of parameters Wi, bi randomly and picked the first values of Wi, bi which 
satisfy (1301) . Integral (1301) was evaluated in quadratures over a uniform grid of 
1000 points in [0,1]. 
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The values of aq and the function fi were chosen in accordance with 0 with 
M" = 2, M' = 1.5 (these values are chosen to assure M" > M' > sup g6 g ||g|| + 
||/||). The iteration was repeated, resulting in the following sequence of functions 


N 

fn(x) = ^ c i9{wjx + bi), 

2=1 

Cj = ai( 1 — a,; + i)(l — on+ 2 ) ■■■(! — a jv) 


Evolution of the normalized approximation error 


- ... e N 2 ll/AT ~ /|| 2 

6N ll/ll 2 ll/ll 2 


(31) 


for 100 trials is shown in Fig. [2] (middle panel). Each trial consisted of 100 
iterations thus leading to the networks of 100 elements at the 100th 

step. We observe that the values of e/v monotonically decrease as 0(1/N), with 
the behavior of this approximation procedure consistent across trials. 

Second, we implemented an approximator based on the Monte-Carlo inte¬ 
gration. At the Nth. step of the approximation procedure we pick randomly 
an element from G, where w G [0,200], b G [—200,200] (uniform distribution). 
After an element is selected, we add it to the current pool of basis functions 

{g(wix + 61 ),.. .,g(w N -\x + &jv-i)}- 


Then the weights c% in the superposition 

N 

In = T c i9{wf x + bi) 

2 = 1 

are optimized so that ||/jv — /|| —> min. Evolution of the normalized approxi¬ 
mation error ejy (l3lT) over 100 trials is shown in Fig. [2] (panel (c)). As can be 
observed from the figure, even though the values of Bn form a monotonically 
decreasing sequence, they are far from 1 /N, at least for 1 < N < 100. Be¬ 
havior across trials is not consistent, at least for the networks smaller than 100 
elements, as indicated by a significant spread among the curves. 

Overall comparison of these two methods is provided in Fig. [3] in which 
the errors Bn are presented in the form of a box plot. Black solid curves de¬ 
pict the median of the error as a function of the number of elements, N, in 
the network; blue boxes contain 50% of the data points in all trials; “whiskers” 
delimit the areas containing 75% of data, and red crosses show the remaining 
part of the data. As we can see from these plots, random basis function ap¬ 
proximators, such as the RVFL networks, mostly do not match performance of 
greedy approximators for networks of reasonable size. Perhaps, employing in¬ 
tegration methods with variance minimization could improve the performance. 
This, however, would amount to using prior knowledge about the target func¬ 
tion /, making it difficult to apply the RVFL networks to problems in which 
the function / is uncertain. 
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Figure 2: Practical speed of convergence of function approximators that use greedy algorithm 
(panel (b)) and Monte-Carlo based random choice of basis functions (panel (c)). The target 
function is shown in panel (a). 
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Figure 3: Box plots of convergence rates for function approximators that use greedy algorithm 
(panel (a)) and Monte-Carlo random choice of basis functions (panels (b) and (c)). Panel (b) 
corresponds to the case in which the basis functions leading to ill-conditioning were discarded. 
Panel (c) shows performance of the MLP trained by the method in Q which is effective at 
counteracting ill-conditioning while adjusting the linear weights only. The red curve shows the 
upper bound for ejy calculated in accordance with ©. We duplicated the average performance 
of the greedy algorithm (grey solid curve) in the middle and bottom panels for convenience 
of comparison. 
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Now we demonstrate performance of an MLP trained to approximate this 
target function. The NN is trained by a gradient based method described in 
0. At first, the full network training is carried out for several network sizes 
N = 20, 40, 60, 80 and 100 and input samples randomly drawn from x € 
[0,1]. The values of ejv are 1.5 • 10” 4 for all the network sizes (as confirmed 
in many training trials repeated to assess sensitivity to weight initialization). 
This suggests that training and performance of much smaller networks should 
be examined. The networks with N = 2,4,6,8,10 are trained, resulting in 
ejv = 0.5749, 0.1416, 0.0193,0.0011, 0.0004, respectively, averaged over 100 trials 
per the network size. Next, we train only the linear weights (cj in (pZjl) of the 
MLP, fixing the nonlinear weights Wi and bi to random values. The results for 
Sn averaged over 100 trials are shown in Fig. 0 bottom panel (black curve). 
Remarkably, the results of random basis network with N = 100 are worse than 
those of the MLP with N > 4 and full network training. These results indicate 
that both the greedy and the Monte-Carlo approximation results shown in Fig. 
[3] are quite conservative. Furthermore, the best of those two, i.e., the greedy 
approximation’s, can be dramatically improved by a practical gradient based 
training. 

6.2. Measure concentration effects 

Measure concentration effects, as presented in Section[4l have been discussed 
for idealized objects such as 5" _1 (i?) and B n (R). The phenomenon, however, 
broadly applies to other objects whose geometric and formal description is not 
limited to the former. 

In order to illustrate this point we analysed a database of HOG feature 
vectors § containing representations of images of facefl as well as the negatives 
(non-faces). Each feature vector has 1920 components, and hence belongs to K" 
with n = 1920. Vectors of each classes have been centered and normalized so 
that they belong to the hypercube [—1,1]". Fig. []]shows distributions of angles 
between a randomly chosen vector (1-st) and that of the rest in their respective 
classes. As one can see from this figure, the angles concentrate in a vicinity of 
7 t/ 2 which is consistent with our derivations for points in B n ( 1). 

Another interesting effect of measure concentration is exponential growth 
of the lengths of chains of randomly chosen vectors which are pairwise almost 
orthogonal. In order to illustrate and assess validity of our estimates (E51) . (E51) 
the following numerical experiments have been performed. A point is first ran¬ 
domly selected in a hypercube [—1,1]” of some given dimension. The second 
point is randomly chosen in the same hypercube. These two points correspond 
to two vectors randomly drawn in [—1, l] n . If the angle between the vectors was 
within 7 t/2 ± 0.0377t/ 2 then the vector was retained. At the next step a new 
vector is generated in the same hypercube, and its angles with the previously 
generated vectors are evaluated. If these angles are within 0.03 77 t/ 2 of n/2 then 


4 The database has been developed by Apical LTD. 
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Faces 




Angles between the first vector in the sample and the rest 

(b) 


Figure 4: Measure concentration in high dimensions. Panel (a) shows histogram of angles 
between a randomly chosen feature vector in the set of ’’faces” and the rest of the vectors in 
the class. Panel (b) shows histogram of angels between a randomly chosen feature vector in 
the set of ” non-faces” (negatives) and the rest of vectors in this class. 
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Figure 5: Lengths N of pairwise almost orthogonal chains of vectors that are independently 
randomly sampled from [—1, l] n as a function of dimension, n. For each n 20 pairwise al¬ 
most orthogonal chains where constructed numerically. Boxplots show the second and third 
quartiles of this data for each n, red bars correspond to the medians, and blue stars indicate 
means. Red curve shows theoretical bound l28ll . and green curve shows refined estimate (29]). 


the vector is retained. The process is repeated until the chain of almost or¬ 
thogonality breaks, and the number of such pairwise almost orthogonal vectors 
(length of the chain) is recorded. Results are shown in Figure 0 Red line corre¬ 
sponds to the conservative theoretical estimate (l28l) . green curve shows refined 
estimate, (l29l) . and box plot shows lengths of pairwise almost orthogonal chains 
as a function of dimension. The value of $ was set to 0.1 for both theoreti¬ 
cal estimates, and our choice of precision margins n/2 ± 0.0377r/2 corresponds 
to £ = cos(0.9637r/2) = 0.0581. As we can see from this figure our empirical 
observations are well aligned with theoretical predictions. 

6.3. Approximation of a constant: dimensionality blowup 

Another, and perhaps, more interesting illustration of measure concentration 
and orthogonality effects belongs to the field of function approximation. Let us 
suppose that we are to approximate a given continuous function defined on an 


24 


























interval [0,1] by linear combinations of the following type 


N 



(32) 


where the function p is defined as follows 



For simplicity we suppose that the function to be approximated, /* is a constant: 


f*(x) = 1 V x G [0,1], 


Linear combinations of cr, . •) can uniformly approximate every contin¬ 
uous function on [0,1]. Furthermore the chosen function /* can be represented 
by just a single element with a = 0.5, a = 0.5: f*[x) = y>(0.5,0.5, x). Since we 
assumed no prior knowledge of these parameters, we approximated the func¬ 
tion f* with linear combinations (1321) in which the values of ai , Oi were chosen 
randomly in the interval [0,1], and the values of Cj were chosen as follows 



(33) 


In order to evaluate performance of approximation as a function of N the 
following iterative procedure has been used. On the first step the values of a\ 
and (7 1 are randomly drawn from the interval [0,1]. This is followed by finding 
the value of c\ in accordance with (1531) (it is clear though that c\ = 1). Next 
the values of 02,02 are drawn from [0,1] followed by determination of optimal 
weights ci, C 2 - The Li error of approximation is recorder for each step. The 
process repeats until N = 500. 

Fig. [G] presents 20 different error curves corresponding to different growing 
systems of functions {</2(ai, cr,, - )}^=i- Despite that the problem is both a) simple 
and b) admits explicit solutions with respect to the values of ci,..., cat, perfor¬ 
mance of such approximations in terms of convergence rates is far from ideal. 
One can observe that initially there is a significant drop of the approximation 
error for N < 100. After this value, however, the error decays very slowly and 
its rate of decay nearly stalls for N large. 

One of potential explanation of this effect is as follows. Fig. 0 shows func¬ 
tions f*(x) — fN{x) for N = 5, 50, and 500 along a single typical curve from Fig. 
m As we can see from the these figures the error functions become more and 
more pitchy or spiky with N. The individual spikes are randomly distributed on 
[0,1], and their thickness is converging to zero. To compensate for such errors 
one needs to be able to generate a very narrow tp(ai,<Ji, •) which, in addition, is 
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Figure 6: Errors of approximation of f*(x) = 1 by linear combinations fw(x), , as 
functions of N. 


to be placed in the right location. Effectively, if error reduction at each step is 
sought for, this is equivalent to dimensionality growth of the problem at each 
step. However, in accordance with (E51) to overcome negative effect of measure 
concentration, one needs to accumulate exponentially large number of samples. 
This is reflected in very slow convergence at the end of the process. 

7. Conclusion 

In this work we demonstrate that, despite increasing popularity of random 
basis function networks in control literature, especially in the domain of intel- 
ligent/adaptive control, one needs to pay special attention to practical aspects 
that may affect performance of these systems in applications. 

First, as we analyzed in Section [3] and showed in our examples, although 
the rate of convergence of the random basis function approximator is qualita¬ 
tively similar to that of the greedy approximator, the rate of the random basis 
function approximator is statistical. In other words, small approximation errors 
are guaranteed here in probability. This means that, in some applications such 
as e.g. practical adaptive control in which the RVFL networks are to model 
or compensate system uncertainties, employment of a re-initialization with a 
supervisory mechanism monitoring quality of the RVFL network is necessary. 
Unlike network training methods that adjust both linear and nonlinear weights 
of the network, such mechanism may have to be made robust against numerical 
problems (ill-conditioning). 

Our conclusion about the random basis function approximators is also con¬ 
sistent with the following intuition. If the approximating elements (network 
nodes) are chosen at random and not subsequently trained, they are usually not 
placed in accordance with the density of the input data. Though computation¬ 
ally easier than for nonlinear parameters, training of linear parameters becomes 
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(b) 



(c) 


Figure 7: Error function f*{x) — fN{x) for TV = 5 (panel (a)), TV = 50 (panel (b)), and 
TV = 500 (panel (c)) for a typical curve from Fig. [6] 
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ineffective at reducing errors “inherited” from the nonlinear part of the approx¬ 
imator. Thus, in order to improve effectiveness of the random basis function 
approximators one could combine unsupervised placement of network nodes ac¬ 
cording to the input data density with subsequent supervised or reinforcement 
learning values of the linear parameters of the approximator. However, such a 
combination of methods is not-trivial because in adaptive control and modeling 
one often has to be able to allocate approximation resources adaptively - and 
the full network training seems to be the natural way to handle such adaptation. 

Second, we showed that in high dimensions exponentially large number of 
randomly and independently chosen vectors are almost orthogonal with proba¬ 
bility close to one. This implies that in order to represent an element of such 
high-dimensional space by linear combinations of randomly and independently 
chosen vectors, it may often be necessary to generate samples of exponentially 
large length if we use bounded coefficients in linear combinations. On the other 
hand, if coefficients with arbitrarily large values are allowed, the number of ran¬ 
domly generated elements that are sufficient for approximation is even less than 
dimension. In the latter case, however, we have to pay for such a significant 
reduction of the number of elements by ill-conditioning of the approximation 
problem. 

A simple numerical example that illustrates such behavior has been provided. 
Not only this example demonstrates the effect of measure concentration in a 
simple approximation problem, it also highlights practical implications for other 
approximation schemes that are based on randomization. 
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